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ABSTRACT 

User representation learning is vital to capture diverse user pref¬ 
erences, while it is also challenging as user intents are latent and 
scattered among complex and different modalities of user-generated 
data, thus, not directly measurable. Inspired by the concept of user 
schema in social psychology, we take a new perspective to per¬ 
form user representation learning by constructing a shared latent 
space to capture the dependency among different modalities of 
user-generated data. Both users and topics are embedded to the 
same space to encode users’ social connections and text content, 
to facilitate joint modeling of different modalities, via a proba¬ 
bilistic generative framework. We evaluated the proposed solution 
on large collections of Yelp reviews and StackOverflow discussion 
posts, with their associated network structures. The proposed model 
outperformed several state-of-the-art topic modeling based user 
models with better predictive power in unseen documents, and 
state-of-the-art network embedding based user models with im¬ 
proved link prediction quality in unseen nodes. The learnt user 
representations are also proved to be useful in content recommen¬ 
dation, e.g., expert finding in StackOverflow. 
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1 INTRODUCTION 

Inferring user intent from recorded user behavior data has been 
studied extensively for user modeling [11, 22, 31, 39, 40], Essentially, 
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user modeling builds up conceptual representations of users, which 
help automated systems to better capture users’ needs and enhance 
user experience in such systems [9, 17]. The rapid development of 
social media enables users to participate in online activities and 
create vast amount of observational data, such as social interac¬ 
tions [15, 16] and opinionated text content [8, 12, 27], which in turn 
provides informative signs about user intents and enables more ac¬ 
curate user representation learning. Extensive efforts have proved 
the value of user representation learning in various real-world ap¬ 
plications, such as latent factor models for collaborative filtering 
[18, 29], topic models for content modeling [23, 38], network em¬ 
bedding models for social link prediction [5, 20], and many more 
[31, 42], 

User representation learning is challenging, and it can never 
be a straightforward application of existing statistical learning 
algorithms on user-generated data. First, user-generated data is 
noisy incomplete, highly unstructured, and tied with social interac¬ 
tions [34], which imposes serious challenges in modeling such data. 
For example, in an environment where users are connected, e.g., 
social network, user-generated data is potentially related, which 
directly breaks the popularly imposed independent and identically 
distributed assumptions in most learning solutions [10, 20, 32]. Sec¬ 
ond, users often participate in various online activities simultane¬ 
ously which creates instrumental contextual signals across different 
modalities. Although oftentimes scattered and sparse, such multi¬ 
modal observations reflect users’ underlying intents as a whole 
and call for a holistic modeling approach [19]. Ad-hoc data-driven 
solutions inevitably isolate the dependency and hence fail to create 
a comprehensive representation of users. For example, users’ social 
interactions [5, 28] and their generated text data [4, 23, 38] have 
been extensively studied for user representation learning, but they 
are largely modeled in isolation. Third, consequently a unified user 
representation learning solution is preferred to serve different ap¬ 
plications, by taking advantage of data-rich applications to help 
those data-poor applications. 

Even among a few attempts for joint modeling of different types 
of user-generated data [12, 43], explicit modeling of dependency 
among multiple behavior modalities is still missing. For example, 
Yang et al. [43] incorporated user-generated text content into net¬ 
work representation learning via joint matrix factorization. In their 
solution, content modeling is only used as a regularization for net¬ 
work modeling; and thus the learnt model is not in a position to 
predict unseen text content. Gong and Wang [12] paired the task 
of sentiment classification with that of social network modeling, 
and represented each user as a mixture over the instances of these 
paired tasks. Though text and network are jointly considered, they 
are only correlated by sharing the same mixing component, without 
explicitly modeling of the mutual influence between them. 



In social psychology and cognitive science, the concept of user 
schema defines the knowledge structure a person holds which orga¬ 
nizes categories of information and the relationships among such 
categories [37]. Putting it into the scenario of user modeling, we 
naturally interpret the knowledge structure as user representation 
described by the collection of associated data, such as the set of tex¬ 
tual reviews and behavioral logs associated with individual users. 
The interrelation existing among multiple types of data further 
motivates us to perform user modeling in a joint manner while the 
concept of distributed representation learning [2], i.e., embedding, 
provides us one possible solution. By constructing a shared latent 
space, we can embed different modalities of user-generated data in 
the same low-dimensional space, where the structural dependency 
among them can be realized by the proximity among different em¬ 
beddings. The space should be constructed in such a way: 1) the 
properties of each modality of user-generated data is preserved; 2) 
the closeness among different modalities of user-generated data 
can be characterized by the similarity measured in the latent space. 
For example, connected users in a social network should be closer 
to each other in this latent space; and by mapping other types of 
user behavior data into this same space, e.g., text data or behavioral 
logs, users should be surrounded by their own generated data. 

To realize this new perspective of user representation learning, 
we exploit two most widely available and representative forms of 
user-generated data, i.e., text content and social interactions. We 
develop a probabilistic generative model to integrate user modeling 
with content and network embedding. Due to the unstructured 
nature of text, we appeal to statistical topic models to model user¬ 
generated text content [4, 38], with a goal to capture the underlying 
semantics. We define a topic as a probability distribution over a fixed 
vocabulary [4], We embed both users and topics to the same low¬ 
dimensional space to capture of their mutual dependency. On one 
hand, a user’s affinity to a topic is characterized by his/her proximity 
to the topic’s embedding in this latent space, which is utilized to 
generate each text document of the user. On the other hand, the 
affinity between users is directly modeled by the proximity between 
users’ embeddings, which are utilized to generate the corresponding 
social network connections. In this latent space, the two modalities 
of user-generated data are correlated explicitly, indicated by the 
user’s topical preferences. The user representation is obtained by 
posterior inference of those embedding vectors over a set of training 
data, via variational Bayesian. To reflect the nature of our proposed 
user representation learning method, we name the solution Joint 
Network Embedding and Topic Embedding, or JNET for short. 

Extensive empirical evaluations are performed on two large 
collections of user-generated text documents from Yelp and Stack- 
Overflow, together with their network structures. Compared with a 
set of state-of-the-art user representation learning solutions, clear 
advantages of JNET are observed: the model’s predictive power in 
content modeling is enhanced on users with rich social connections, 
and similar improvement is observed in its prediction in network 
modeling on users with rich text data. The use of learnt user repre¬ 
sentation generalizes beyond content modeling and social network 
modeling: it accurately suggests technical discussion threads for 
users to participate in StackOverflow, e.g., expert recommendation. 


2 RELATED WORK 

In order to learn effective user representations, a lot efforts have 
been devoted to modeling diverse modalities of user-generated data: 
1) in an isolated manner, i.e., focusing on one particular modality 
of user-generated data such as text reviews; 2) in a joint manner, 
i.e., utilizing multiple types of user data. Our proposed solution 
falls into the second category as it learns user representations from 
both network structure and text content by explicitly capturing the 
dependency between the two modalities in the latent topic space. 

When performing user representation learning in an isolated 
way, much attention has been paid on exploring user-user interac¬ 
tions to learn users’ distributed representations, which are essential 
for better understanding users’ interactive preferences in social 
network analysis. Inspired from word embedding techniques [25], 
random walk models are exploited to generate random paths over 
a network to learn dense, continuous and low-dimensional repre¬ 
sentations of users [13, 28, 35], Matrix factorization technique is 
also commonly used to learn user embeddings [26, 41], as learn¬ 
ing a low-rank space for an adjacency matrix representing the 
network naturally fits the need of learning low-rank user/node 
embeddings. For instance, Tang and Liu [36] factorize an input 
network’s modularity matrix and use discriminative training to 
extract representative dimensions for learning user representation. 

In parallel, the user-generated text data is utilized to understand 
users’ emphasis on specific entities or aspects. Topic models [4, 14] 
serve as a building block for statistical modeling of text data. Typi¬ 
cal solutions model individual users as a bag of topics [30], which 
govern the generation of associated text documents. Wang and Blei 
[38] combine topic modeling with collaborative filtering to estimate 
topical user representations with additional observations from user- 
item ratings. Wang et al. [39] use topic modeling to estimate users’ 
detailed aspect-level preferences from their opinionated review con¬ 
tent. Lin et al. [21] learn users’ personalized topical compositions 
to differentiate user’s subjectivity from item’s intrinsic property 
in the review documents. McAuley and Leskovec [23] uncover the 
implicit preferences of each user as well as the properties of each 
product by mapping users and items into a shared topic space. Some 
recent works use deep neural networks to obtain user embedding 
from their generated text data [7, 33]. 

Although most previous works studied social networks and user¬ 
generated text content in isolation, little attention has been paid in 
combining the two sources for better user modeling. Earlier work 
[24] regularizes a statistical topic model with a harmonic regular- 
izer defined on the network structure. Yang et al. [43] incorporate 
text features of users into network representation learning via joint 
matrix factorization. Gong and Wang [12] pair tasks of opinionated 
content modeling and network structure modeling in a group-wise 
fashion, and model each user as a mixture over the tasks. Though 
both text and network are utilized for user modeling in the afore¬ 
mentioned works, explicit modeling of dependence among different 
modalities is still missing. Archarya et al. [1] explore the depen¬ 
dency among documents and network but on a per-community 
basis instead of a per-user basis. Our work proposes a holistic view 
to model users’ social preferences and topical interests jointly, thus 
to provide a more general understanding of user intents from mul¬ 
tiple perspectives. 



3 JOINT NETWORK EMBEDDING AND TOPIC 
EMBEDDING 

To interrelate different modalities of user-generated data for user 
representation learning, we propose to perform joint network em¬ 
bedding and topic embedding. In this section, we first provide the 
details of our probabilistic generative model, JNET, which imposes 
a complete generative process over user-generated social interac¬ 
tions and text data in each individual user. Then we describe our 
variational Bayesian based Expectation Maximization algorithm, 
which retrieves the learnt user representation from a given corpus. 

3.1 Model Specification 

We denote a collection of 17 users as 'Ll = {u\, U 2 , ...uu}, in which 
each user Uj is associated with a set of documents fD; = {x ; d}j-v 
Each document is represented as a bag of words = {wj, W 2 ,.., rv/y}, 

where each w n is chosen from a vocabulary of size V. Each user 
is also associated with a set of social connections denoted as £, = 
{ e ij}pti’ where ejj = 1 indicates user Uj and Uj are connected in 
the network; otherwise, ey = 0. 

We represent each user as a real-valued continuous vector Uj E 
R M in a low-dimensional space. And we seek to impose a joint 
distribution over the observations in each user’s associated text 
documents and social interactions, so as to capture the underlying 
structural dependency between these two types of data. Based on 
our assumption that both types of users-generated data are gov¬ 
erned by the same underlying user intent, we explicitly model the 
joint distribution asp(lD;,£;) = f p(fD;,£;, Uj)duj, which can be 
further decomposed intop(lD;,£;, Uj) = p(fD;|£;, ui)p(&i\uj)p(uj). 
We assume given the user representation Uj , the generation of text 
documents in Dj is independent from the generation of social inter¬ 
actions in£i, i.e.,p(fD/|£/, Uj) = p{Dj\uj). As a result, the modeling 
of joint probability over a user’s observational data with his/her 
latent representation can be decomposed into three related model¬ 
ing tasks: 1) p(Oj\uj ) for content modeling, 2) p(£; \uj) for social 
connection modeling, and 3)p(uj) for user embedding modeling. 

We appeal to topic models [4,14] due to their effectiveness shown 
in existing empirical studies for content modeling. The concept of 
user schema inspires us to embed both users and topics to the same 
latent space in order to capture the dependency between them. By 
projecting a user’s embedding vector to topic embedding vectors, 
we can easily measure affinity between a user and a topic, and thus 
capture users’ topical preferences. It also allows us to capture the 
topical variance in documents from the same user and establish a 
valid predictive distribution of his/her documents. 

Formally, we assume there are in total K topics underlying the 
corpus with each represented as an embedding vector (j)^ E R M 
in the same latent space; denote <t> E R. KxM as the matrix of topic 
embeddings, which facilitate our representation of each user’s affin¬ 
ity towards different topics: <F • Uj, which reflects user Uj’s topical 
preferences, and serves as the prior of topic distribution in each text 
document from him/her. Specifically, denote the document-level 
topic vector as E R^", we have ~ N(4 > • Uj, r -1 7), where 
T characterizes the uncertainty when user Uj is choosing topics 
from his/her global topic preferences for each single document. 
By projecting the document-level topic vector into a proba¬ 
bility simplex, we obtain the topic distribution for document Xj ^: 



Figure 1: Graphical model representation of JNET. The up¬ 
per plate indexed by K denotes the learnt topic embeddings. 
The outer plate indexed by U denotes distinct users in the 
collection. The inner plates indexed by U and D denote each 
user’s social connections and text documents respectively. 
The inner plate indexed by N denotes the word content in 
one text document. 

Xidk = softmax(0 ;dlt ) = exp(6> /dfc )/i;^ 1 exp (6 id i), from which 
we sample a topic indicator Z; E {1, ...,-JC} for each word w, 
in Xj d by Zj^ n ~ Multi(;r ; jj-). As in conventional topic models, 
each topic k is also associated with a multinomial distribution /?{- 
over a fixed vocabulary, and each word is then drawn from 
the respective word distribution indicated by corresponding topic 
assignment, i.e., w,^ n ~ p( w l/^z j( j„)- Putting all pieces together, 
the task of content modeling for each user can be summarized as 

p(Dj\uj) = Ylj^pidid^i^i^Yl^piZidnlOi^PiWidnlZidri’P)- 
The key in modeling social connections is to understand the 
closeness among users. As we represent users with a real-valued 
continuous vector, this can be easily measured by the vector inner 
product in the learnt low-dimensional space. Define the underly¬ 
ing affinity between a pair of users Uj and uj as Sjj, we assume 
E[ Sjj ] = ujuj. To capture uncertainty of the affinity between differ¬ 
ent pairs of users, we further assume Sjj is drawn from a Gaussian 
distribution centered at the measured closeness, Sjj ~ N(ujuj, /f 2 ), 
where f characterizes the concentration of this distribution. The 
observed social connection ejj between user Uj and uj is then 
assumed as a realization of this underlying user affinity: ejj ~ 
Bernoulli(logistic(i5 ! y)) where logistic(<5y) = 1/(1 + exp(- As 
a result, the task of social connection modeling can be achieved by 
P(£il«;) = ny±iP(eij\8ij)p(8ij\ui,Uj). 

We do not have any specific constraint on the form of latent user 
embedding vectors {u; }Y =1 and topic embedding j, as long 

as they are in a M-dimensional space. For simplicity, we assume they 
are drawn from isotropic Gaussian distributions respectively, i.e., 
Uj ~ A/(0, y -1 I), where y measures the concentration of different 
users’ embedding vectors, and (j)^ ~ N(0. a -1 /). Other types of 
prior distribution can also be introduced, if one has more knowledge 
about the user and topic embeddings, such as sparsity or a particular 
geometric shape. But it is generally preferred to have conjugate 
priors, so as to simplify later posterior inference steps. 

Putting these components together, the generative process of 
our solution can be described as follows: 

• For each topic (j)^: 

- Draw its topic compact representation </>j- ~ 7V(0, a -1 I) 

• For each user Uj\ 

- Draw its user compact representation Uj ~ /V(0, y -1 I) 

- For every other user uj: 

* Draw affinity Sjj between Uj and uj, Sjj ~ Af(uJ Uj, /f 2 ) 





* Draw interaction e,y between ui and Uj, ey ~ 
Bemoulli(logistic(5y)) 

• For each document of user up 

- Draw the user-document topic preference vector 

e id ~ ■ ui,T~ l i) 

- For each word w ; ^„: 

* Draw topic assignment Zi dn ~ Multi(softmax(0 !( /)) 

* Draw word w idn ~ Multi(j8 Zj</n ) 

We make two explicit assumptions here: 1) the dimensionality 
M of the compact representation of topics and users is predefined 
and fixed; 2) the word distributions under topics are parameterized 
by a K X V matrix /5 where /?*„ = p(w v \z^) over a fixed vocabulary 
of size V. The generative model captures the interrelation between 
multiple modalities of user-generated data for user representation 
learning. In essence, we are performing a Joint Network Embedding 
and Topic Embedding, thus, we name the resulting model as JNET 
in short. 

3.2 Variational Bayesian Inference 

The compact user representations can be obtained via posterior 
inference over the latent variables on a given set of data. Flowever, 
posterior inference is not analytically tractable in JNET due to the 
coupling among latent variables, i.e., user-user affinity S, user em¬ 
bedding u, topic embedding <F, document-level topic proportion 
9 and word-level topic assignment z. We appeal to a mean-field 
variational method to approximate the posterior distributions, and 
further utilize Taylor expansion [3] to address the difficulty intro¬ 
duced by non-conjugate logistic-normal priors. 

We begin by postulating a factorized distribution: g(<F, U,A,@,Z) = 

nf =1 q(h) n£i *(«.-)[ n mi) n£ =1 q(e id ) n? =1 

where the factors have the following parametric forms: 

q(fo) = |// ( H £<**>), g(«i) = V(u;|// Ui) ,5: (Ui) ), 

q(Sij) = N(8ij\^ s ‘j\a^\q(9 id ) = N(9 id \p^ d \^ e ‘ d \ 

q( z idn) ~ Mult(z/^ n l^idn) 

Because the topic proportion vector 9j d is inferred in each docu¬ 
ment, it is not necessary to estimate a full covariance matrix for 
it [3], Hence, in its variational distribution, we only estimate the 
diagonal variance parameters. 

Variational algorithms aim to minimize the KL divergence from 
the approximated posterior distribution q to the true posterior 
distribution p. It is equivalent to tightening the evidence lower 
bound (ELBO) by Jensen’s inequality [4]: 

logp(w,e|a,j5,y,r) (1) 

> E 9 [logp((7,0, Z, <F, A, w, e\a, p, y, r)] - E 9 [log q(U, 0, Z, <F, A)] 

where the expectation is taken with respect to the factorized vari- 
aitonal distribution of the latent variables q(<F, U, A, 0, Z). Let -C(q) 
denote the right-hand side of Eq (1), the first step of maximizing this 
lower bound is to derive the analytic form of posterior expectations 
required in -C(q). Thanks to the conjugate priors introduced on 
and 4> = {<j >)the expectations related to these latent 
variables have closed form solutions, while due to non-conjugate 
logistic-normal priors, we use Taylor expansions to approximate 


the expectations related to 9i d , Sij. Next we describe the detailed 
inference procedure for each latent variable. 

• Estimate topic embedding. For each topic k. we relate the terms 
associated with q(cf>^\p'^ k ’, ZS^ k ') in Eq (1) and take maximization 
w.r.t. and Z^ k \ Closed form estimations of p^ k \Z^ k ^ exist, 


9 


,(4>k) - 


= rZ 


V Di 


Zud=I»k 

U x-^Di 
2-Jd=l K 


X (^) = [ai+TY^ =1 j^L^ ui) + 


( 2 ) 


The estimation of is not related to a specific topic k, be¬ 
cause we impose an isotropic Gaussian prior for all {<j> *}j^_ in 
JNET. It suggests that the correlations between different topic em¬ 
bedding dimensions are homogeneous across topics. Interestingly, 
we can notice that the posterior covariance of topic embed¬ 

dings is closely related to user embeddings, which indicates direct 
dependency from network structure to text content. 

• Estimate user embedding. For each user i, we relate the terms 
associated with q(uj 1//“;)^ jJui)) j n £q (1) and maximize it with 
respect to Z\ u ‘\ Closed form estimations can also be achieved 
for these two parameters as follows: 


„<„> =x <„> (r £ ^ w „<»«>,,»> + £ , w r'W 1 ) 

Z (lli) =J/ j + tD . ^^_ l (E ( ^ ) + pt-M 

+ r 2 (^ uj) + p (uj) p ( uj)T ) ( 3 ) 


The effect of joint content modeling and network modeling for 
user representation learning is clearly depicted in this posterior 
estimation of user embedding vectors. The updates of p' u '^ and 
Z' Ui ’ come from two types of influence: the text content and so¬ 
cial interactions of the current user. For example, the posterior 
mode estimation of user embedding vector «; is a weighted aver¬ 
age over the topic vectors that this user has used in his/her past 
text documents and the user vectors from his/her friends. And the 
weights measure his/her affinity to those topics and users in each 
specific observation. The updates exactly reflect the formation of 
“user schema” in social psychology from two perspectives: both 
modalities of user-generated data shape user embeddings, while 
the structural dependency between them is reflected in this unified 
user representation. 

• Estimate per-document topic proportion vector. Similar pro¬ 
cedures as above can be taken to estimate p^ id) and Due 

to the lack of conjugate prior for logistic Normal distributions, we 
apply Taylor expansion and introduce an additional free variational 
parameter // in each document. Because there is no closed form 
solution for the resulting optimization problem, we use gradient 
ascent to optimize and with the following gradients, 


dL/dp ( ° id) 


- rp ( ° id) + rp^ k) V Ui) 

+ Yjn=l [' 7idnfc " ^ _1 eXp ^ 


ffiid) , y (Sid) 
k + ^kk 


/2)] (4) 


dL/dZ { °l d) = - r -Nexp{pf d) + z[ e l d) / 2)/f + 1/X^ 


kk 


J kk 


where f exp(//^' d ^ + E^/ d ^/2). Since only the diagonal ele- 

ments in Z\, are statistically meaningful (i.e., variance), we simply 
set its off-diagonal elements to zero in gradient update. The gradient 



function suggests that the document-level topic proportion vector 
should align with the corresponding compact user representation 
and topic representation. Although no closed form estimations of 
and exist, the expected property of is clearly 

reflected: the proportion of each topic in document X; ^ should 
align with this user’s preference on this topic (i.e., affinity in the 
embedding space) and the topic assignment in document content. 

And the variance is introduced by the uncertainty of per-word topic 
choice and the intrinsic uncertainty of a user’s affinity with a topic. 

• Estimate user affinity. Similar approach can be applied here 

to estimate and a^‘j^ 2 which govern the latent user affinity. 

Again, gradient ascent is utilized to optimize and ~L^ id \ 

dL/dp^ = ey - (T 1 exp (p^ + a (5 ‘i )2 /2 ) - - p( u >) T pK')) 

31/3(7^ exp/p^w) + cr (<5 u )2 / 2 ) - ^~ 2 <7 (' 5 y) + 1 

The gradient functions suggest that the latent affinity between a 
pair of users is closely related with their observed connectivity and 
their closeness in the embedding space. 

• Estimate word topic assignment. The topic assignment z i( j n 
for each word in document X; j can be estimated by, 

( n \ v—^ 

riidnk K exp{p fc id + 2_t v=l w idnv *°g/W 

We execute the above variational inference procedures in an 
alternative fashion until the lower bound J2{q) defined in Eq (1) 
converges. The variational inference algorithm postulates strong 
independence structures between the variational parameters, al¬ 
lowing straightforward parallel computing. Since the variational 
parameters can be grouped by documents: p^ id \ Y.^ id> and 17 , by 
topics: p ( ^ ) and and by users: p (l “\ T. (u ‘\ //%> and <t (S ‘J )2 , 

we perform alternative update in parallel to improve computational 
efficiency: for example, we fix topic-level parameters and user-level 
parameters, and distribute the documents across different machines 
to estimate their own p^ id \ Z (e ‘ d) an d q in parallel for large col¬ 
lections of user-generated data. 

3.3 Parameter Estimation 

When performing the variational inference described above, we 
have assumed the knowledge of model parameters a, y, r, f and 
/?. Based on the inferred posterior distribution of latent variables 
in JNET, these model parameters can be readily estimated by the 
Expectation-Maximization (EM) algorithm. The most important 
model parameters are priors for user embedding y and topic em¬ 
bedding a, and word-topic distribution fl. As i; and r serve as the 
variance for user affinity <5y and document topic proportion vector 
djj, and we have large amount of observations in text documents 
and social connections across all users, our model is less sensitive 
to their settings. Therefore, we estimate f and r less frequently 
than a, y and /?. 

By taking the gradient of -C(q) in Eq (1) with respect to a, and 
set the resulting gradient to 0 , we get the closed form estimation of 
a as follows: 

KM 

+ ^ k)T ^* k) ] 


Similarly, the closed form estimation of y can be easily derived as, 

UM 

Y ~ 

And the closed form estimation for word-topic distribution /5 can 
be achieved by, 

fo” “ Xl XdLl Xn= 1 "idnvlidnv, 
where indicates the nth word in u,’s dth document is the 

zzth term in the vocabulary. The estimation for f and r is omitted 
for space limit, but they can be easily derived based on Eq (1). 

The resulting EM algorithm consists of E-step and M-step. In 
E-step, the variational parameters are inferred based the procedures 
described in Section 3.2; and in M-step, the model parameters are 
estimated based on collected sufficient statistics from E-step. These 
two steps are repeated until the lower bound £.{q) converges over 
all training data. 

Inferring the latent variables with each user and each topic are 
computationally cheap. Specifically, by Eq (2), updating the vari¬ 
ables for each topic imposes a complexity of 0(KM 2 |D|), where K 
is the total number of topics, M is the latent dimension, |D| is the to¬ 
tal number of documents. By Eq (3), updating the variables for each 
user imposes a complexity of 0(M 2 U 2 ) where U is the total number 
of users. Estimating the latent variables for the per-document topic 
proportion imposes a complexity of 0(\D\K(N + M)) by Eq (4), 
where N is the average document length. And updating variables 
for each pair of user affinity takes constant time while there are 
U 2 affinity variables. With the consideration of the total number of 
users and topics, the overall complexity for the proposed algorithm 
is 0(KM 2 \D\ + M 2 U 2 ). 

4 EXPERIMENTS 

We evaluated the proposed model on large collections of Yelp re¬ 
views and StackOverflow forum discussions, together with their 
user network structures. Qualitative analysis demonstrates the de¬ 
scriptive power of JNET through direct mapping of user and topic 
embeddings into a 2-D space. The explicit modeling of dependency 
among user-generated data confirms the effectiveness of JNET, as 
indicated by the model’s predictive power in recovering missing 
links and modeling unseen documents. The learnt user representa¬ 
tion also enables accurate content recommendation to users. 

4.1 Experiment Settings 

Datasets. We employed two large publicly available user-generated 
text datasets together with the associated user networks: 1) Yelp, 
collected from Yelp dataset challenge , consists of 187,737 Yelp 
restaurant reviews generated by 10,830 users. The Yelp dataset 
provides user friendship imported from their Facebook friend con¬ 
nections. Among the whole set of users, 10,194 of them have friends 
with an average of 10.65 friends per user. 2) StackOverflow, col¬ 
lected from Stackoverflow.com 2 , consists of 244,360 forum discus¬ 
sion posts generated by 10,808 users. While there is no explicit 
network structure in StackOverflow dataset, we utilized the “ reply- 
to” information in the discussion threads to build a user network, 

1 Yelp dataset challenge, http://www.yelp.com/dataset_challenge 
2 StackOverflow. http://stackoverflow.com 
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Figure 2: Visualization of user embedding and learnt topics in 2-D space of Yelp (left) and StackOverflow (right). 


because this relation suggests implicit social connections among 
users based on their expertise and technical topic interest. We ended 
up with 10,041 connected users, with an average of 5.55 connections 
per user. We selected 5,000 unigram and bigram text features based 
on Document Frequency (DF) in both datasets. We randomly split 
the data for 5-fold cross validation in all the reported experiments. 

Baselines. We compared the proposed JNET model against a 
rich set of user representation learning methods, including topic 
modeling based solutions, the network embedding methods, and 
models performing joint modeling of text and network. 1) Latent 
Dirichlet Allocation (LDA) [4] generates the topic distribution in 
documents across different users, and the user presentation is con¬ 
structed by averaging the posterior topic proportion of documents 
associated with a user. 2) Relational Topic Model (RTM) [6] ex¬ 
plicitly models the connection between two documents and we 
constructed a user-level network by concatenating all documents 
of one user in this baseline. 3) Hidden Factors and Hidden Top¬ 
ics (HFT) [23] combines latent rating dimensions of users with 
latent review topics for user modeling. Users’ “upvote” toward a 
question is utilized as a proxy of rating in StackOverflow. 4) Collab¬ 
orative Topic Regression (CTR) [38] combines collaborative fil¬ 
tering with topic modeling to explain the observed text content and 
ratings. 5) DeepWalk (DW) [28] takes truncated random walks as 
input to learn social representations of vertices in the network. 6) 
Text-Associated DeepWalk (TADW) [43] further incorporates 
text content of vertices into network representation learning under 
the framework of joint matrix factorization. 

Parameter Settings. We set the latent dimensions of user and 
topic embeddings to 10 in both JNET and baselines as larger dimen¬ 
sion gives limited performance improvement but slows down all 
models considerably. As we tuned the topic size from 10 to 100, we 
found the learnt topics are most representative and meaningful at 
around 40 topics. Hence, we set topic number to 40 in the reported 
experiments. The maximum number of iteration in our EM algo¬ 
rithm is set to 100. Both the source codes and data are available 
online 3 . 

4.2 The Learnt User Representations 

We first study the quality of the learnt user representations from 
JNET. The learnt user embeddings are mapped to a 2-D space using 

3 JNET. https://github.com/Linda-sunshme/JNET. 



Figure 3: Perplexity comparison on Yelp and StackOverflow. 


the t-SNE algorithm and is visualized in Figure 2. For illustration 
purpose, we simply assign each user to the topic that he/she is clos¬ 
est to, i.e., arg max^(0^ • tq) and we mark users sharing the same 
interested topic with the same color. We also plot the most represen¬ 
tative words of each topic learnt from JNET (i.e., arg max w p(w|/J z )), 
with the same color of the corresponding set of users. 

As we can find from the visualization of StackOverflow, users 
of similar interests are clearly clustered in the 2-D space, which 
indicates the descriptive power of our learnt user vectors. Mean¬ 
while, we can easily identify the theme of each learnt topic, such as 
C++ (in light green circle), SQL (in dark purple circle) and java (in 
light blue circle). It is also interesting to find correlations among the 
users and topics by looking into their distances. The users in dark 
green are mainly interested in website development, thus are far 
away from the users who are interested in C++ (in light green). The 
users in orange care more about the network communication and 
they are overlapped with other clusters of users focusing on SQL (in 
dark purple) and C++ (in light green) as network communication is 
an important component among different programming languages. 
Similar observations can also be found on Yelp dataset. 


4.3 Document Modeling 

In order to verify the predictive power of the proposed model, we 
first evaluated the generalization quality of JNET on the document 
modeling task. We compared all the topic model based solutions by 
their perplexity on a held-out test set. Formally, the perplexity for a 
set of held-out documents is calculated as follows [4]: 
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Figure 4: Comparison of perplexity in cold-start users on Yelp. 


where p{wj) is the likelihood of each held-out document given by 
a trained model. A lower perplexity indicates better generalization 
quality of a model. 

Figure 3 reports the mean and variance of the perplexity for each 
model with 5-fold cross validation over different topic sizes. JNET 
achieved the best predictive power on the hold-out dataset, espe¬ 
cially when an appropriate topic size is assigned. RTM achieved 
comparative performance as it utilizes the connectivity information 
among users, but it is limited by not being able to capture the vari¬ 
ance within each user’s different documents. The other baselines 
do not explicitly model network data, i.e., LDA, HFT and CTR, and 
therefore suffer in their performance. 

A good joint modeling of network structure and text content 
should complement each other to facilitate a more effective user 
representation learning. Hence, we expect a good model to learn 
reasonable representations on users lacking text information, a.k.a., 
cold-start users, by utilizing network structure. We randomly se¬ 
lected 200 users and held out all their text content for testing. Re¬ 
garding to the number of social connections each testing user has 
in training data, we further consider three different sets of users, 
and name them as light, medium and heavy users, to give a finer 
analysis with respect to the degree of connectivity in cold-start set¬ 
ting. The threshold for categorizing different sets of users is based 
on the statistics of each dataset; and each group contains 200 users. 
In particular, we selected 5 and 20 as the connectivity threshold 
for Yelp, 5 and 15 as the threshold for StackOverflow respectively. 
That is, in Yelp, light users have fewer than 5 friends, medium users 
have more than 5 friends while fewer than 20 friends and heavy 
users have more than 20 friends. We compared JNET against four 
baselines, i.e., LDA, HFT, RTM, CTR for evaluation purpose. We 
reported the perplexity on the held-out test documents regarding 
to the three sets of users, in Figure 4. 

As we can observe in Figure 4, JNET performed consistently 
better on the testing documents for the three different sets of unseen 
users on Yelp dataset, which indicates the advantage of utilizing 
network information in addressing cold-start content prediction 
issue. The benefit of network is further verified across different sets 
of users as heavily connected users can achieve better performance 
improvement compared with text only user representation model, 
i.e., LDA. Similar conclusion is obtained for StackOverflow dataset, 
while we neglect it due to the space limit. 

4.4 Link Prediction 

The predictive power of JNET is not only reflected in unseen docu¬ 
ments, but also in missing links. In the task of link prediction, the 


key component is to infer the similarity between users. We split 
the observed social connections into 5 folds. Each time, we held 
out one fold of edges for testing and utilized the rest for model 
training, together with users’ text content. In order to construct a 
valid set of ranking candidates for each testing user, we randomly 
injected irrelevant users (non-friends) for evaluation purpose. And 
the number of irrelevant users is proportional to the number of con¬ 
nections a testing user has, i.e., t X number of social connections. 
We rank users based on the cosine similarity between their embed¬ 
ding vectors. Normalized discounted cumulative gain (NDCG) and 
mean average precision (MAP) are used to measure the quality of 
ranking. We started with the ratio between irrelevant users and 
relevant users being t = 2 and increased the ratio to t = 8 to make 
the task more challenging to further verify the effectiveness of the 
learnt user representations. 

To compare the prediction performance, we tested five baselines, 
i.e., LDA, HFT, RTM, DW and TADW. We reported the NDCG and 
MAP for the two datasets in Figure 5. It is clear JNET achieved en¬ 
couraging performance on both datasets, which indicates effective 
user representations are learnt to recover network structure. In Yelp 
dataset, network-only solutions, i.e., DW, and text-only solutions, 
i.e., LDA and HFT, cannot take the full advantage of both modalities 
of user-generated data to capture user intents, while RTM achieved 
descent performance due to the integration of content and network 
modeling. Since the way of constructing network in StackOver¬ 
flow is more content oriented, the performance of link prediction 
on StackOverflow prefers the text based solutions, which explains 
the comparable performance of LDA. Though TADW utilizes both 
modalities for user modeling, it fails to capture the dependency 
between them, leading to the poor performance on this task. 

In practice, link prediction for unseen users is especially useful. 
For example, friend recommendation for new users in a system: 
they have very few or no friends, while they may associate with 
rich text content. This is also known as “cold-start” link prediction. 
Network-only solutions will suffer from the lack of information in 
such users. However, a joint model can overcome this limitation by 
utilizing user-generated text content to learn representative user 
vectors, thus to provide helpful link prediction results. 

In order to study the models’ predictive power in the cold-start 
setting, we randomly sampled three sets of users, regarding to 
the number of documents each user has, and name them as light, 
medium and heavy users accordingly. Each set of users consists of 
200 users, and we selected 10 and 50 as the threshold for Yelp, 15 
and 50 as the threshold for StackOverflow respectively. For example, 
in StackOverflow, light users have fewer than 15 posts, medium 
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Figure 5: The performance comparison of link suggestion on Yelp and StackOverflow. 

Table 1: The performance comparison of link prediction for cold-start users on StackOvewrflow. 


Models 

Ratio=2 

NDCG/MAP 

Light 

Ratio=4 

NDCG/MAP 

Ratio=6 

NDCG/MAP 

Ratio=2 

NDCG/MAP 

Medium 

Ratio=4 

NDCG/MAP 

Ratio=6 

NDCG/MAP 

Ratio=2 

NDCG/MAP 

Heavy 

Ratio=4 

NDCG/MAP 

Ratio=6 

NDCG/MAP 

LDA 

0.786/0.648 

0.664/0.477 

0.632/0.431 

0.774/0.597 

0.677/0.451 

0.612/0.364 

0.818/0.581 

0.745/0.443 

0.697/0.366 

HFT 

0.666/0.493 

0.543/0.333 

0.483/0.259 

0.671/0.461 

0.562/0.313 

0.492/0.226 

0.682/0.389 

0.591/0.250 

0.532/0.179 

RTM 

0.777/0.642 

0.688/0.514 

0.627/0.433 

0.801/0.638 

0.709/0.495 

0.654/0.419 

0.837/0.624 

0.760/0.481 

0.711/0.399 

TADW 

0.695/0.525 

0.583/0.373 

0.515/0.291 

0.696/0.481 

0.591/0.336 

0.532/0.263 

0.739/0.448 

0.639/0.298 

0.587/0.229 

JNET 

0.794/0.664 

0.697/0.534 

0.643/0.453 

0.812/0.649 

0.724/0.511 

0.663/0.425 

0.842/0.626 

0.763/0.483 

0.713/0.399 


users have more than 15 but fewer than 50 posts, and heavy users 
have more than 50 posts. We compared JNET against four baselines, 
i.e., LDA, HFT, RTM and TADW for evaluation purpose. Because 
DW cannot learn representations for users without any network 
information, it is excluded in this experiment. We also randomly 
injected irrelevant users as introduced before for evaluation and 
we varied the ratio to change the difficulty of the task. We reported 
the NDCG and MAP performance on the three sets of users in 
Stackoverflow dataset with three different ratios, i.e., 2, 4 and 6, in 
Table 1, respectively. 

JNET achieved consistently favorable performance in cold-start 
users, as accurate proximity between users is properly identified 
with its user representations learnt from text data. Comparing 
across user groups, better performance is achieved for users with 
more text documents. Similar results were obtained on Yelp dataset 
as well, but omitted due to space limit. 
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Figure 6: Expert recommendation on StackOverflow. 

between them. In the meanwhile, the closeness between users can 
be simply measured by the distance of their corresponding embed¬ 
ding vectors. As a result, the task can be formalized as finding the 
user that achieves the highest relatedness with the given question, 
where we define the relatedness as follows: 


4.5 Expert Recommendation 

In the sampled StackOverflow dataset, the average number of an¬ 
swers for questions is as low as 1.14, which indicates the difficulty 
for getting an expert to answer the question. If the system can 
suggest the right user to answer the posted questions, e.g., push the 
question to the selected user, more questions would be answered 
more quickly and accurately. We conjecture the learnt topic distribu¬ 
tion of each question in StackOverflow, together with the identified 
user representation, can facilitate the task of expert recommenda¬ 
tion for question answering. The task can be further decomposed 
into two components: whether the question falls into a user’s skill 
set; and whether the user who asked the question shares similar 
interests with the potential candidate experts. With the learnt topic 
embeddings <I> and each user’s embedding u;, each user’s interest 
over topics can be characterized as a mapping from the topic embed¬ 
dings to the user’s embedding, i.e., <F ■ w;. Together with the learnt 
topic distribution of each question, we can estimate the proximity 
between a question and a user’s expertise to score the alignment 


score = a • cosinefu; • <F, 6^) + (1 - a) ■ cosinefu;, Uj) (5) 

Due to the limited number of answers for each question in our 
dataset, we selected 1,816 questions with more than 2 answers for 
the experiment. Besides the users that answered the given ques¬ 
tion, we also incorporated irrelevant users for each question for 
evaluation purpose. And the number of irrelevant users is 10 times 
of the number of answers. We compared against the learnt topic 
distributions of questions and user representations from LDA, HFT 
and CTR as we cannot get the topic distribution of each question 
from the other baselines. As we tune the weight between the two 
components in Eq (5), we plot the corresponding NDCG and MAP 
in Figure 6. 

JNET achieved very promising performance in this recommen¬ 
dation task, as it explicitly models a user’s expertise and a given 
question in the topic space. The estimated similarities between 
user-user and user-content accurately align the question to the 
right user. The baseline models can only capture the similarity be¬ 
tween questions and users based on their topical similarity, which 






















is insufficient in this task. Interestingly, as we gradually increased 
the weight of question-content similarity from 0 to 1, JNET’s per¬ 
formance peaked, which indicates the relative importance between 
user-user and user-content similarities for this specific problem. 

5 CONCLUSION AND FUTURE WORK 

In the paper, we studied user representation learning by explicitly 
modeling the structural dependency among different modalities 
of user-generated data. We proposed a complete generative model 
named JNET to integrate user representation learning with content 
modeling and social network modeling. The learnt user representa¬ 
tions are interpretable and predictive, indicated by the performance 
improvement in many important tasks such as link prediction and 
expert recommendation. 

Several areas are left open for our future explorations. The cur¬ 
rent model focuses on the first-order proximity among users in 
network modeling, while higher-order proximity can be explored 
to better capture the network connectivity. Also, temporal infor¬ 
mation of the text content and connections are not considered in 
the current model. By properly utilizing the temporal information, 
we would be able to learn the dynamics of user representations, 
together with the evolution of topics and social network. 
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A VARIATIONAL INFERENCE 

In this section, we provide the detailed derivation of the likelihood 
lower bound in Eq (1). 

Recall that we begin by postulating a factorized distribution: 

<j(4>, U, A, 0, Z) = Y\ K k=1 q(h) flL q(u ‘ } 

l Ytw q(5ij) nli ^ n:. } \ 

where the factors have the following parametric forms: 

qifa) = = N{ Ui \^ Ui) M u *\ 

q(Sij) = N(8iM 5 ‘j\a^\q(9 id ) = U(9 id |p<Z^)), 

q( z idn ) = Mult(z/^ n \qidn) 

The log likelihood of observed user behaviors, e.g., posted texts 
and connected social relations, is bounded by a lower bound using 
Jensen’s inequality: 

logpO,e]a,/?,y,r) 

> Eq[logp(U, 0, Z, A, w,e\a,fl, y, r)] - Eq[log q(U, 0, Z, A)] 

= E 9 [logp($|a)] + E q [\og p(JJ\y)] + E g [logp(A|C/)] + E ? [logp(e|A)] 
+ E ? [logp(0|(7,<I>, r)] +E ? [logp(Z|0)] + E ? [logp(w|Z,/?)] 
-E 9 [log<j(<l>, U, A, @,Z)] 

Thanks to the conjugate priors introduced on U and 5>, the ex¬ 
pectations related to these latent variables are straightforward. 
However, the calculations of E ? [logp(e|A)] and Eq[logp(Z|0)] are 
difficult due to no conjugate prior for logistic Normal distribution. 
We will first provide details of these two nontrivial expectations, 
and then list the other terms for reproducibility. 

• Compute Eg[logp(e|A)]. The nonconjugacy of logistic normal 
leads to difficulty in computing the expected probability of edge 
assignment between U; and up 

Eq[logp(ey|<Sy)] = E q [eijSij\ -E ? [log(l + exp(<5y))] 

We utilize the inequality of logarithm logx < x — 1, and set 
x = £ -1 (l + exp(5y)), to approximate the second term: 

log(l + exp(<5y)) < £ _1 (1 + exp(5y)) - 1 + loge 

Thus, the corresponding expectation is as follows, 

E 9 [log(l + exp(<5y))] < e _1 E g [l + exp(5y)] - 1 + log£ 

where the expectation is mean of log normal: 

1 2 

E ? [l + exp(<5y)] = 1 + exp) 

Put them together, we get the expectation as follows: 

Eq [log p(ey 15y)] 

> eijf / SiJ 1 - £ _1 (1 + exp(p < '' 5y ) + — )) + 1 - log s 

where a new variational parameter e is introduced, and we set 
e = 1 + exp(Sij) to approach the equality. 

• Compute Eq[logp(Z|0)J. The nonconjugacy of logistic normal 
also exists in computing the expectation of topic assignment for 
each word of U;’s d -th document: 

Eq[logp(z idn \e id )] = E q [9j d z idn ] - Eq[l°g(X^ = i ex P ( e idk))] 


We again utilize the equality of logarithm logx < x - 1, and set 
x = f -1 Z/vLj exp (6j d k) to compute the second term: 

log ^ exp(d idk ) < f _1 exp(d idk ) - 1 + logf 

Thus the second term is calculated as: 

Eq[log(^ =l exp (6 idk ))] < C\Yik=l Eq ^ Xp( ' 9idk ^ ~ 1 + 
where the expectation is mean of log normal distribution: 

E q [exp(6 idk )] = exp(p fc ^> + 

Putting them together, we get the expectation as follows: 

E g [logp(z idn |0 i( /)] 

> ^ id) \idn - r 1 Xti e Mn (did) + \^ d) ) +1 - logf 

where another variational parameter f is introduced, and we set 
f = X, k _ 1 exp(0j dk ) to approach the equality. 

• Compute Eq[logp(<E>|a)J. Topic embedding follows Gaussian 
distributions forp and q , and the corresponding expectation is: 

Eq[logp(0 fc |a)] oc yloga- |[Xm= 1 Z mm + P (M 

• Compute Eq[logp(l/|y)]. User embedding also follows Gaussian 
distributions, thus: 

Eq [log p(“i I/)] k y l°g Y~ :i X ™ +P (u, ' ) V“ i) ] 

• Compute Eq[logp(A|l/)]. The affinity <5y between a pair of users 
Uj and Uj follows Gaussian Distributions. Thus, the corresponding 
expectation can be written as follows: 

Eq[logp(<5y|uj, ty)] 

cx - log f - -^Eq[(dy - ujuj) 2 ] 

= -logf - ^Eq[5y] + ~^E q [uJUj]E q [§ij] - ^-E q[(li\uj) 2 } 

where the expectations of Gaussian distributions for 8 and u can 
be directly written, thus we get: 


Eq[logp(<5y|u;,ty)] 


oc - log f - -L(p ^) 2 + n^> 2 ) + 


• Compute Eq[logp(0|U, r)\. The topic proportion of each 
user’s document follows Gaussian distribution. The corresponding 
expectation can be written as: 


E q [logp(O id \ui,q>, r)\ 

x y l°g r ~ ^{E q[9] d d id \ -E q [e] d q>uj] -E q [u]& 8 id \ +E q [u]® J $Ui]} 



where <E and u also follows Gaussian and the calculation is straight¬ 
forward, thus we get: 

E< ? [logp(0 id |u;,<I>, r)] 

« f log* - + / did) \ (e ‘ d) ] + r Yjti 4 flw V** ) V“ <) 

-12t=i (E(Ui) + f' (Ui V (Ui)T ) T (^ ( ^ ) + p ( ^y^ )T ) 

• Compute Eg [log p(w | Z, /5)].The expectation of word assignment 
is given by: 

E ? [logp(w idn |z idn ,/?)] = Y^ k=l YL= 1 lo S^J 

= Ylk =1 £„=i w Wb'»W.* 1o ^ 

B PARAMETER ESTIMATION 

By taking the gradient of J2(q) in Eq (1) with respect to the variance 
of user affinity f, and set it to 0, we get the closed form estimation 
as follows: 

£ 2 =[ U(U - 1)] _1 , ^ S ‘ j '> 2 + a ^ Si ^ 2 - 

+ (E (u;) + p(“ ! )p( u .) T ) T (s(“t) + p(%)p(^) T )l 

Similarly, the closed form estimation for the variance of document 
topic proportion r is given by: 

zL z”:, [<zL 

- y^_ i (2//^ id) // ( ^ ) V u ‘ ) - (E (ui) + p(“.)/“; )T ) T (z(^ ) +p(<ky<fe )T ))] 



